Lecture 14
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.
The sklearn package contains a large number of submodules which are specialized for different tasks / models,
sklearn.base - Base classes and utility functionssklearn.calibration - Probability Calibrationsklearn.cluster - Clusteringsklearn.compose - Composite Estimatorssklearn.covariance - Covariance Estimatorssklearn.cross_decomposition - Cross decompositionsklearn.datasets - Datasetssklearn.decomposition - Matrix Decompositionsklearn.discriminant_analysis - Discriminant Analysissklearn.ensemble - Ensemble Methodssklearn.exceptions - Exceptions and warningssklearn.experimental - Experimentalsklearn.feature_extraction - Feature Extractionsklearn.feature_selection - Feature Selectionsklearn.gaussian_process - Gaussian Processessklearn.impute - Imputesklearn.inspection - Inspectionsklearn.isotonic - Isotonic regressionsklearn.kernel_approximation - Kernel Approximationsklearn.kernel_ridge - Kernel Ridge Regressionsklearn.linear_model - Linear Modelssklearn.manifold - Manifold Learningsklearn.metrics - Metricssklearn.mixture - Gaussian Mixture Modelssklearn.model_selection - Model Selectionsklearn.multiclass - Multiclass classificationsklearn.multioutput - Multioutput regression and classificationsklearn.naive_bayes - Naive Bayessklearn.neighbors - Nearest Neighborssklearn.neural_network - Neural network modelssklearn.pipeline - Pipelinesklearn.preprocessing - Preprocessing and Normalizationsklearn.random_projection - Random projectionsklearn.semi_supervised - Semi-Supervised Learningsklearn.svm - Support Vector Machinessklearn.tree - Decision Treessklearn.utils - UtilitiesTo begin, we will examine a simple data set on the size and weight of a number of books. The goal is to model the weight of a book using some combination of the other features in the data.
The included columns are:
volume - book volumes in cubic centimeters
weight - book weights in grams
cover - a categorical variable with levels "hb" hardback, "pb" paperback
scikit-learn uses an object oriented system for implementing the various modeling approaches, the class for LinearRegression is part of the linear_model submodule.
Each modeling class needs to be constructed (potentially with options) and then the resulting object will provide attributes and methods.
When fitting a model, scikit-learn expects X to be a 2d array-like object (e.g. a np.array or pd.DataFrame) but will not accept a pd.Series or 1d np.array.
Error: ValueError: Expected 2D array, got 1D array instead:
array=[ 885 1016 1125 239 701 641 1228 412 953 929 1492 419 1010 595
1034].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Error: ValueError: Expected 2D array, got 1D array instead:
array=[ 885 1016 1125 239 701 641 1228 412 953 929 1492 419 1010 595
1034].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
Depending on the model being used, there will be a number of parameters that can be configured when creating the model object or via the set_params() method.
LinearRegression(fit_intercept=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression(fit_intercept=False)
Once the model coefficients have been fit, it is possible to predict using the model via the predict() method, this method requires a matrix-like X as input and in the case of LinearRegression returns an array of predicted y values.
array([ 725.10251417, 832.43407276, 921.74048411, 195.81864507,
574.34673721, 525.18724472, 1006.13094621, 337.5618484 ,
780.81660565, 761.15280865, 1222.43271315, 343.29712253,
827.51812351, 487.49830048, 847.1819205 ])
volume weight cover weight_lm_pred
0 885 800 hb 725.102514
1 1016 950 hb 832.434073
2 1125 1050 hb 921.740484
3 239 350 hb 195.818645
4 701 750 hb 574.346737
5 641 600 hb 525.187245
6 1228 1075 hb 1006.130946
7 412 250 pb 337.561848
8 953 700 pb 780.816606
9 929 650 pb 761.152809
10 1492 975 pb 1222.432713
11 419 350 pb 343.297123
12 1010 950 pb 827.518124
13 595 425 pb 487.498300
14 1034 725 pb 847.181921
There is no built in functionality for calculating residuals, so this needs to be done by hand.
Scikit-learn expects that the model matrix be numeric before fitting,
the solution here is to dummy code the categorical variables - this can be done with pandas via pd.get_dummies() or with a scikit-learn preprocessor.
Do the following results look reasonable? What went wrong?
d = read.csv('data/daag_books.csv')
d['cover_hb'] = ifelse(d$cover == "hb", 1, 0)
d['cover_pb'] = ifelse(d$cover == "pb", 1, 0)
lm = lm(weight~volume+cover_hb+cover_pb, data=d)
summary(lm)
Call:
lm(formula = weight ~ volume + cover_hb + cover_pb, data = d)
Residuals:
Min 1Q Median 3Q Max
-110.10 -32.32 -16.10 28.93 210.95
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.91557 59.45408 0.234 0.818887
volume 0.71795 0.06153 11.669 6.6e-08 ***
cover_hb 184.04727 40.49420 4.545 0.000672 ***
cover_pb NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 78.2 on 12 degrees of freedom
Multiple R-squared: 0.9275, Adjusted R-squared: 0.9154
F-statistic: 76.73 on 2 and 12 DF, p-value: 1.455e-07
These are a set of transformer classes present in the sklearn.preprocessing submodule that are designed to help with the preparation of raw feature data into quantities more suitable for downstream modeling tools.
Like the modeling classes, they have an object oriented design that shares a common interface (methods and attributes) for bringing in data, transforming it, and returning it.
For dummy coding we can use the OneHotEncoder preprocessor, the default is to use one hot encoding but standard dummy coding can be achieved via the drop parameter.
OneHotEncoder(sparse_output=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
OneHotEncoder(sparse_output=False)
array([[1., 0.],
[1., 0.],
[1., 0.],
[1., 0.],
[1., 0.],
[1., 0.],
[1., 0.],
[0., 1.],
[0., 1.],
[0., 1.],
[0., 1.],
[0., 1.],
[0., 1.],
[0., 1.],
[0., 1.]])
Unlike pd.get_dummies() it is not safe to use OneHotEncoder with both numerical and categorical features, as the former will also be transformed.
enc = OneHotEncoder(sparse_output=False)
X = enc.fit_transform(
X = books[["volume", "cover"]]
)
pd.DataFrame(
data=X,
columns = enc.get_feature_names_out()
) volume_239 volume_412 volume_419 ... volume_1492 cover_hb cover_pb
0 0.0 0.0 0.0 ... 0.0 1.0 0.0
1 0.0 0.0 0.0 ... 0.0 1.0 0.0
2 0.0 0.0 0.0 ... 0.0 1.0 0.0
3 1.0 0.0 0.0 ... 0.0 1.0 0.0
4 0.0 0.0 0.0 ... 0.0 1.0 0.0
5 0.0 0.0 0.0 ... 0.0 1.0 0.0
6 0.0 0.0 0.0 ... 0.0 1.0 0.0
7 0.0 1.0 0.0 ... 0.0 0.0 1.0
8 0.0 0.0 0.0 ... 0.0 0.0 1.0
9 0.0 0.0 0.0 ... 0.0 0.0 1.0
10 0.0 0.0 0.0 ... 1.0 0.0 1.0
11 0.0 0.0 1.0 ... 0.0 0.0 1.0
12 0.0 0.0 0.0 ... 0.0 0.0 1.0
13 0.0 0.0 0.0 ... 0.0 0.0 1.0
14 0.0 0.0 0.0 ... 0.0 0.0 1.0
[15 rows x 17 columns]
volume weight cover weight_lm2_pred
0 885 800 hb 833.351907
1 1016 950 hb 927.403847
2 1125 1050 hb 1005.660805
3 239 350 hb 369.553788
4 701 750 hb 701.248418
5 641 600 hb 658.171193
6 1228 1075 hb 1079.610041
7 412 250 pb 309.712515
8 953 700 pb 698.125490
9 929 650 pb 680.894600
10 1492 975 pb 1085.102558
11 419 350 pb 314.738191
12 1010 950 pb 739.048853
13 595 425 pb 441.098050
14 1034 725 pb 756.279743
Scikit-learn comes with a number of builtin functions for measuring model performance in the sklearn.metrics submodule - these are generally just functions that take the vectors y_true and y_pred and return a scalar score.
Create and fit a model for the books data that includes an interaction effect between volume and cover.
You will need to do this manually with pd.getdummies() and some additional data munging.
We will now look at another flavor of regression model, that involves preprocessing and a hyperparameter - namely polynomial regression.
It is certainly possible to construct the necessary model matrix by hand (or even use a function to automate the process), but this is less then desirable generally - particularly if we want to do anything fancy (e.g. cross validation)
This is another transformer class from sklearn.preprocessing that simplifies the process of constructing polynormial features for your model matrix. Usage is similar to that of OneHotEncoder.
PolynomialFeatures(degree=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
PolynomialFeatures(degree=3)
array([[ 1., 0., 0., 0.],
[ 1., 1., 1., 1.],
[ 1., 2., 4., 8.],
[ 1., 3., 9., 27.],
[ 1., 4., 16., 64.],
[ 1., 5., 25., 125.]])
array(['1', 'x0', 'x0^2', 'x0^3'], dtype=object)
If the feature matrix X has more than one column then PolynomialFeatures transformer will include interaction terms with total degree up to degree.
array([[0, 1],
[2, 3],
[4, 5]])
array([[ 0., 1., 0., 0., 1., 0., 0., 0., 1.],
[ 2., 3., 4., 6., 9., 8., 12., 18., 27.],
[ 4., 5., 16., 20., 25., 64., 80., 100., 125.]])
array(['x0', 'x1', 'x0^2', 'x0 x1', 'x1^2', 'x0^3', 'x0^2 x1', 'x0 x1^2',
'x1^3'], dtype=object)
array([[0, 1, 2],
[3, 4, 5]])
array([[ 0., 1., 2., 0., 0., 0., 1., 2., 4.],
[ 3., 4., 5., 9., 12., 15., 16., 20., 25.]])
array(['x0', 'x1', 'x2', 'x0^2', 'x0 x1', 'x0 x2', 'x1^2', 'x1 x2',
'x2^2'], dtype=object)
You may have noticed that PolynomialFeatures takes a model matrix as input and returns a new model matrix as output which is then used as the input for LinearRegression. This is not an accident, and by structuring the library in this way sklearn is designed to enable the connection of these steps together, into what sklearn calls a pipeline.
from sklearn.pipeline import make_pipeline
p = make_pipeline(
PolynomialFeatures(degree=4),
LinearRegression()
)
pPipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=4)),
('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=4)),
('linearregression', LinearRegression())])PolynomialFeatures(degree=4)
LinearRegression()
Once constructed, this object can be used just like our previous LinearRegression model (i.e. fit to our data and then used for prediction)
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=4)),
('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=4)),
('linearregression', LinearRegression())])PolynomialFeatures(degree=4)
LinearRegression()
array([ 1.6295693 , 1.65734929, 1.6610466 , 1.67779767, 1.69667491,
1.70475286, 1.75280126, 1.78471392, 1.79049912, 1.82690007,
1.82966357, 1.83376043, 1.84494343, 1.86002819, 1.86228095,
1.86619112, 1.86837909, 1.87065283, 1.88417882, 1.8844024 ,
1.88527174, 1.88577463, 1.88544367, 1.86890805, 1.86365035,
1.86252922, 1.86047349, 1.85377801, 1.84937708, 1.83754576,
1.82623453, 1.82024199, 1.81799793, 1.79767794, 1.77255319,
1.77034143, 1.76574288, 1.75371272, 1.74389585, 1.73804309,
1.73356954, 1.65527727, 1.64812184, 1.61867613, 1.6041325 ,
1.5960389 , 1.56080881, 1.55036459, 1.54004364, 1.50903953,
1.45096594, 1.43589836, 1.41886389, 1.39423307, 1.36180712,
1.23072992, 1.21355164, 1.11776117, 1.11522002, 1.09595388,
1.06449719, 1.04672121, 1.03662739, 1.01407206, 0.98208703,
0.98081577, 0.96176797, 0.87491417, 0.87117573, 0.84223005,
0.84171166, 0.82875003, 0.8085086 , 0.79166069, 0.78167248,
0.78078036, 0.73538157, 0.7181484 , 0.70046945, 0.67233502,
0.67229069, 0.64782899, 0.64050946, 0.63726823, 0.63526047,
0.62323271, 0.61965166, 0.61705548, 0.6141438 , 0.60978056,
0.60347713, 0.5909255 , 0.566617 , 0.50905785, 0.44706202,
0.44177711, 0.43291379, 0.40957833, 0.38480262, 0.38288511,
0.38067928, 0.3791518 , 0.37610476, 0.36932957, 0.36493067,
0.35806518, 0.3475729 , 0.3466828 , 0.33332696, 0.30717941,
0.3006981 , 0.29675876, 0.29337641, 0.29333354, 0.27631567,
0.26899076, 0.2676092 , 0.2672602 , 0.26716133, 0.26241605,
0.25405246, 0.25334542, 0.25322869, 0.25322576, 0.25410989,
0.25622496, 0.25808334, 0.25849729, 0.26029845, 0.26043195,
0.26319956, 0.26466962, 0.26480578, 0.2648598 , 0.26488966,
0.28177285, 0.28525208, 0.28861016, 0.28917644, 0.29004253,
0.29444629, 0.29559749, 0.30233373, 0.30622039, 0.31322114,
0.31798208, 0.32104799, 0.32700307, 0.32822585, 0.32927281,
0.3326599 , 0.33397022, 0.33710573, 0.34110873, 0.34140708,
0.34707419, 0.35926445, 0.37678278, 0.37774536, 0.38884519,
0.39078249, 0.39517758, 0.40743395, 0.41040931, 0.42032703,
0.43577431, 0.46157615, 0.46668313, 0.47144763, 0.47196742,
0.47425178, 0.47510175, 0.47762453, 0.48381558, 0.48473821,
0.4906733 , 0.50202549, 0.50448149, 0.50674907, 0.50959756,
0.51456778, 0.51694399, 0.51848152, 0.52576027, 0.53292675,
0.53568264, 0.53601729, 0.53790775, 0.53878741, 0.53876248,
0.53838784, 0.53822688, 0.53756849, 0.53748661, 0.53650016,
0.53481469, 0.53372126, 0.53274257, 0.52871724, 0.52377536,
0.52346188, 0.52313791, 0.52286872, 0.49655523, 0.49552641,
0.47578596, 0.4669369 , 0.43757684, 0.38609879, 0.38104404,
0.31131919, 0.2984486 , 0.28774333, 0.27189053, 0.25239709,
0.2384553 , 0.22915234, 0.17792316, 0.17355182, 0.09982541,
0.09880754, 0.09413432, 0.09001771, 0.0844749 , 0.01787073,
-0.00849026, -0.03051945, -0.06842454, -0.09116713, -0.10695813,
-0.13889128, -0.20217854, -0.2210452 , -0.23334664, -0.39045798,
-0.46280636, -0.47155946, -0.48247123, -0.5697079 , -0.57972246,
-0.68977946, -0.81351875, -0.83477874, -0.88303201, -0.91521502,
-0.96937509, -0.99388351, -1.1634133 , -1.19336585, -1.21548881])
The attributes of steps are not directly accessible, but can be accessed via steps or named_steps attributes,
By accessing each step we can adjust their parameters (via set_params()),
{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}
LinearRegression(fit_intercept=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression(fit_intercept=False)
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=4)),
('linearregression', LinearRegression(fit_intercept=False))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=4)),
('linearregression', LinearRegression(fit_intercept=False))])PolynomialFeatures(degree=4)
LinearRegression(fit_intercept=False)
0.0
array([ 1.61366366, 7.39051417, -57.67175293, 102.72227443,
-55.38181361])
These parameters can also be directly accessed at the pipeline level, note how the names are constructed:
{'memory': None, 'steps': [('polynomialfeatures', PolynomialFeatures(degree=4)), ('linearregression', LinearRegression(fit_intercept=False))], 'verbose': False, 'polynomialfeatures': PolynomialFeatures(degree=4), 'linearregression': LinearRegression(fit_intercept=False), 'polynomialfeatures__degree': 4, 'polynomialfeatures__include_bias': True, 'polynomialfeatures__interaction_only': False, 'polynomialfeatures__order': 'C', 'linearregression__copy_X': True, 'linearregression__fit_intercept': False, 'linearregression__n_jobs': None, 'linearregression__positive': False}
Pipeline(steps=[('polynomialfeatures',
PolynomialFeatures(degree=4, include_bias=False)),
('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('polynomialfeatures',
PolynomialFeatures(degree=4, include_bias=False)),
('linearregression', LinearRegression())])PolynomialFeatures(degree=4, include_bias=False)
LinearRegression()
Pipeline(steps=[('polynomialfeatures',
PolynomialFeatures(degree=4, include_bias=False)),
('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('polynomialfeatures',
PolynomialFeatures(degree=4, include_bias=False)),
('linearregression', LinearRegression())])PolynomialFeatures(degree=4, include_bias=False)
LinearRegression()
1.6136636604768375
array([ 7.39051417, -57.67175293, 102.72227443, -55.38181361])
We’ve already seen a manual approach to tuning models over the degree parameter, scikit-learn also has built in tools to aide with this process. Here we will leverage GridSearchCV to tune the degree parameter in our pipeline.
from sklearn.model_selection import GridSearchCV, KFold
p = make_pipeline(
PolynomialFeatures(include_bias=True),
LinearRegression(fit_intercept=False)
)
grid_search = GridSearchCV(
estimator = p,
param_grid = {"polynomialfeatures__degree": range(1,10)},
scoring = "neg_root_mean_squared_error",
cv = KFold(shuffle=True)
)GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=True),
estimator=Pipeline(steps=[('polynomialfeatures',
PolynomialFeatures()),
('linearregression',
LinearRegression(fit_intercept=False))]),
param_grid={'polynomialfeatures__degree': range(1, 10)},
scoring='neg_root_mean_squared_error')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=True),
estimator=Pipeline(steps=[('polynomialfeatures',
PolynomialFeatures()),
('linearregression',
LinearRegression(fit_intercept=False))]),
param_grid={'polynomialfeatures__degree': range(1, 10)},
scoring='neg_root_mean_squared_error')Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
('linearregression', LinearRegression(fit_intercept=False))])PolynomialFeatures()
LinearRegression(fit_intercept=False)
cv_results_array([-0.55749641, -0.55411766, -0.53337083, -0.46185477, -0.28410974,
-0.2837936 , -0.26782574, -0.22542718, -0.21997084])
array([9, 8, 7, 6, 5, 4, 3, 2, 1], dtype=int32)
array([0.00097075, 0.00087657, 0.00086861, 0.00086889, 0.00087271,
0.0008781 , 0.00087924, 0.00088468, 0.00088801])
dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_polynomialfeatures__degree', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])
Sta 663 - Spring 2023